You need to do this only once!
The script below checks whether the listed CRAN packages are installed
on your computer. If they are not, they will be automatically
installed.
## First specify the packages of interest
packages = c("devtools","hdf5r","dplyr","ggplot2","stringr",
"RColorBrewer","useful","readr","BiocManager")
## Now load or install&load all
package.check <- lapply(
packages,
FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x, dependencies = TRUE)
library(x, character.only = TRUE)
}
}
)Seurat for scRNAseq data analysis
install.packages('Seurat')The following is necessary for interaction with hdf5 objects, and can only be installed using BiocManager installer.
# You need this only once
BiocManager::install("rhdf5")In this COO, you will learn how to analyse single cell RNA sequencing data (scRNAseq) using the well established Seurat pipeline.
This pipeline includes functions that do most of the work ‘behind the scenes’. This makes the tool very user friendly and suitable for todays tutorial. There is extensive documentation on the use of the pipeline. The following tutorial is the closest to what we will do today: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html
You may need to run this on the terminal, since in RStudio, the working directory is set to your home directory after the chunk is run! An alternative is to use the pull down menu at the top of the bottom-right window on RStudio. Click Files/More to see the option
setwd("/Users/onurbasak/Documents/1_BCRM/11 teaching/Bachelor/Bioinformatics_BMW/2023/COO/")library(dplyr,verbose = FALSE)
library(ggplot2,verbose = FALSE)
library(stringr,verbose = FALSE)
library(Seurat,verbose = FALSE)
library(RColorBrewer,verbose = FALSE)
library(useful,verbose = FALSE)
library(readr,verbose = FALSE)
library(hdf5r,verbose = FALSE)For today’s tutorial, we will use the scRNAseq atlas of the adolescent mouse cortex published by Saunders et al 2018. This data is extensive and is available at mousebrain.org, which we briefly discussed at the end of the lecture. There is an online tool with which you can browse the data. You can do this, for instance, to get inspiration for your practice questions.
\(~\)
Why? A major use of the scRNAseq is cell type identification. To achieve
this, you need to perform quality control steps and cluster analysis. To
visualize the data, you will perform dimensionality reduction. Finally,
you can plot marker genes that you find from the literature to reveal
the cell type identity of clusters \(~\)
It is time for the scRNAseq analysis! We will use the Seurat object that we have uploaded. This object made specially for the Seurat pipeline has a lot of ‘slots’ where information can be stored, and the architecture is a bit complicated. You do not need it for this tutorial, except what is mentioned
The data (‘Linnerson_cortex_10X22.rds’) downloaded and processed into a ‘Seurat object’ to prevent technical errors caused by the ‘loom’ file format in several computers.
You can download them from Blackboard, in the COO/data folder, or from the following Github page made for this course: https://github.com/neurOnur/BMW_scRNAseq_COO_2023
Load the Seurat object that was saved as rds.
dataset <- readRDS(file = 'data/Linnerson_cortex_10X22.rds')
dataset## An object of class Seurat
## 27998 features across 6658 samples within 1 assay
## Active assay: RNA (27998 features, 0 variable features)
Note: The object contains data from 6658 cells (samples) and 27998 features (genes). There is 1 assay (RNA).
To save from time, we can subset the Seurat object by selecting 1000 random cells. For this, we can use the subset() function.
# Run only if you have performace issues
# dataset_backup = dataset
# dataset <- subset(dataset, downsample = 1000)
# datasetThis is where ‘cell level’ information is stored. This means there is one value for each cell.
kable(head(dataset@meta.data[,1:6]),digits = 6)| orig.ident | nCount_RNA | nFeature_RNA | Age | AnalysisPool | AnalysisProject | |
|---|---|---|---|---|---|---|
| cortex1_10X22_2_ACAATAACTGTAGC-1 | 10X22 | 5030 | 1942 | p23 | Cortex1 | Adolescent |
| cortex1_10X22_1_AGGTTCGAAACGTC-1 | 10X22 | 2122 | 1010 | p23 | Cortex1 | Adolescent |
| cortex1_10X22_1_TGACTTTGGCATAC-1 | 10X22 | 2677 | 1219 | p23 | Cortex1 | Adolescent |
| cortex1_10X22_2_GTGAGGGACGTAAC-1 | 10X22 | 2648 | 1232 | p23 | Cortex1 | Adolescent |
| cortex1_10X22_2_CACCGTTGGACGTT-1 | 10X22 | 4586 | 1939 | p23 | Cortex1 | Adolescent |
| cortex1_10X22_2_GAGCGAGAACGTAC-1 | 10X22 | 5807 | 2292 | p23 | Cortex1 | Adolescent |
An important metric is the number of RNA molecules (nCount_RNA) and genes (nFeature_RNA) per cell. These are automatically calculated when the Seurat object is generated form a data matrix.
VlnPlot(object = dataset, features = c("nCount_RNA", "nFeature_RNA"), cols = "blue",
pt.size = .01)Start by generating QC metrics additional to the no of genes/features.
Mitochondrial RNA is the mRNA that is generated by the mitochondrial genome. Normally, these constitute a small fraction of the total mRNA. However, in dying or damaged cells, while cytoplasmic/nuclear mRNA degrades rapidly, mitochondrial mRNA is rather well preserved. Thus, a high ratio of mitochondrial mRNA indicates BAD cell quality.
mRNA coding for the Ribosomal subunit proteins is abundant (not to be confused with rRNA, which does not code for protein but is a part of the ribosome complex). Usually, a high ribosomal RNA percentage indicates production of a lot of proteins, and is very high in dividing cells or some secretory cells that need to constantly produce proteins. However, if most of the mRNA (>30-50%) that we detect is ribosomal, it means that the valuable information in this cell would be very limited and that we should exclude it from the analysis.
dataset <- PercentageFeatureSet(dataset,pattern='^mt-', col.name = "percent.mt")
dataset <- PercentageFeatureSet(dataset,pattern='Rp(s|l)', col.name = "percent.ribo")We can use the VlnPlot() function of the Seurat package to visualise the QC metrics.
plot0 <- VlnPlot(object = dataset, features = c("percent.mt", "percent.ribo"),pt.size = 0, cols = "blue")
plot0Visualize how mito and ribo percentages change as a function of the number of counts.
plot1 <- FeatureScatter(dataset, feature1 = "nCount_RNA", feature2 = "percent.mt",pt.size = 2, cols = "blue")
plot2 <- FeatureScatter(dataset, feature1 = "nCount_RNA", feature2 = "percent.ribo",pt.size = 2, cols = "blue")
plot3 <- FeatureScatter(dataset, feature1 = "nCount_RNA", feature2 = "nFeature_RNA",pt.size = 2, cols = "blue")
plot_null <- ggplot() + theme_void()
(plot1 + plot2) / (plot3 + plot_null)What is the relationship between total number of RNA per cell
(nCounts) and
i) mitochondrial RNA percentage?
ii) ribosomal RNA percentage?
ii) number of features? \(~\)
We do not want to have low quality cells in our data. Looking at the plot, determine which cells to get rid of. \(~\)
\(~\)
Use subset() to fetch the cells that fit in your description
cutoff_mito = ##ENTER A VALUE HERE##
cutoff_ribo = ##ENTER A VALUE HERE###
dataset <- subset(x = dataset, subset = percent.mt < cutoff_mito & percent.ribo < cutoff_ribo)In Seurat, standard preprocessing workflow is replaced by a single command. However, it is good to see this part to learn each step. First, we will normalize the data. This is to get rid of the differences in total RNA counts between cells. In other words, we will equalize the total count number in each cell to a fixed number (e.g. 10000 RNA molecules per cell).
dataset <- NormalizeData(object = dataset, normalization.method = "LogNormalize", scale.factor = 10000)We want to find out ‘informative genes’ that will explain biological differences to use in some of hte downstream applications. If a gene is expressed everywhere, that doesnt tell us much. However, if a gene is expressed in a subset of cells, this will cause ‘variation’.
We can detect these genes using FindVariableFeatures() function.
## Here, we select the top 1,000 highly variable genes (Hvg) for downstream analysis.
dataset <- FindVariableFeatures(object = dataset, selection.method = 'mean.var.plot', mean.cutoff = c(0.0125, 3), dispersion.cutoff = c(0.5, Inf))
length(x = VariableFeatures(object = dataset)) #3084## [1] 8532
Identify the 10 most highly variable genes.
top10 <- head(VariableFeatures(dataset), 10)
top10## [1] "Hba-a1" "Hbb-bs" "Ptgds" "Hbb-bt" "Apoe" "Npy" "Sst"
## [8] "Gm26778" "Erbb2" "Mt1"
Now visualise.
## Plot
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(dataset)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)## When using repel, set xnudge and ynudge to 0 for optimal results
plot2Note:Dispersion indicates variation, while red color shows ‘significantly variable’ genes
Print the top 1000 highly variable genes.
hv.genes <- head(rownames(HVFInfo(object = dataset)), 1000)
head(hv.genes,n=100) # list the first 100## [1] "Cd79b" "Tbxas1" "Tifab" "Itpripl2" "Notum" "Acvrl1"
## [7] "Rad54b" "Mill2" "Hcn2" "Tnfaip6" "Cldn11" "Ermn"
## [13] "Gjb1" "Car14" "Lpar1" "Gsn" "Trf" "Serpinb1a"
## [19] "Itgb4" "Opalin" "Gpr37" "Aspa" "Mag" "Mog"
## [25] "Tmem88b" "Plekhh1" "Dock10" "Tspan2" "Prr5l" "Gjc3"
## [31] "Hapln2" "Ppp1r14a" "Mal" "Ugt8a" "Fa2h" "Arsg"
## [37] "Phldb1" "Gpr17" "Sema5a" "Kif19a" "Cd9" "Bcas1"
## [43] "Tmem163" "Tcf7l2" "Enpp6" "Rassf10" "Col9a3" "Pak4"
## [49] "Rnd2" "Myrf" "Bfsp2" "Il23a" "Ndrg1" "Fgfr2"
## [55] "Ttyh2" "Anln" "Trim59" "Plekhg3" "Padi2" "Pacs2"
## [61] "Cdh19" "Ms4a7" "F13a1" "Lyz2" "Pf4" "Clec4n"
## [67] "Ccl24" "Clec4a2" "Ccl12" "Cyba" "Cbr2" "Cd209f"
## [73] "Mrc1" "Fcer1g" "Cd14" "Cd37" "Hpgds" "Arl11"
## [79] "Stac" "Hk3" "Spi1" "Fcrls" "Slc2a5" "Lcp2"
## [85] "Slc39a12" "Trem2" "Hmha1" "Tgfbr1" "Ext1" "Pld4"
## [91] "AI662270" "Dock2" "Abi3" "Lrrc25" "Cd53" "Cyth4"
## [97] "Siglech" "Adap2os" "Fcgr1" "Fcgr2b"
We discussed scaling (standardization) in the previous section. Here,
we will only take the highly variable genes (hv.genes) to
scale and use in downstream dimensionality reduction.
We can also get rid of the confounding factors at this point. These factors introduce ‘technical noise’ to the data. For instance, the number of reads per cell can influence the amount of information in a cell and make it seem different from another cell with low RNA levels, even though they are similar cells.
We will use the ScaleData() function of hte Seurat
package. The confounding factors can be discarded using
`vars.to.regress`.
dataset <- ScaleData(object = dataset, features = hv.genes, vars.to.regress = c("percent.mt","nFeature_RNA"))## Regressing out percent.mt, nFeature_RNA
## Centering and scaling data matrix
Performing Dimensionality reduction in Seurat is very simple! We can first calculate the PCA.
dataset <- RunPCA(object = dataset, features = hv.genes, verbose = FALSE,npcs = 50)plot1 <- DimPlot(object = dataset, reduction = 'pca',dims = c(1,2))
plot1It is also easy to find out genes that contribute to the + and - direction of the top PCs! We can also plot the level of contribution.
print(dataset[["pca"]], dims = 1:5, nfeatures = 5) # First box## PC_ 1
## Positive: Cldn5, Flt1, Ly6c1, Pglyrp1, Ly6a
## Negative: C1qb, C1qa, C1qc, Ctss, P2ry12
## PC_ 2
## Positive: Cldn11, Mog, Tspan2, Fa2h, Ugt8a
## Negative: C1qa, Ctss, C1qb, Tyrobp, C1qc
## PC_ 3
## Positive: Aldoc, Gpr37l1, Gja1, Slc1a3, Mfge8
## Negative: Mog, Cldn11, Mal, Trf, Ppp1r14a
## PC_ 4
## Positive: Cntnap5a, Calb1, Lamp5, March1, Tmem59l
## Negative: Slc1a3, Gpr37l1, Plpp3, Aldoc, Mfge8
## PC_ 5
## Positive: Calb1, Fam19a1, Lamp5, Rasgrf2, Cntnap5a
## Negative: Rprm, Hs3st4, Nxph3, Foxp2, Syt6
VizDimLoadings(dataset, dims = 1:2, reduction = "pca") # Second boxSome genes contribute more to the variation then others… What
do these genes and principle components tell us?
\(~\)
We can use an integrated function of Seurat to plot heatmaps to visualize genes that drive different principle components
PCHeatmap(dataset,dims = 1:6,ncol =2)Plot some of these genes on PCA plots to see how their expression is distributed along different PCs. For this, we will look at the first 4 PCs, indentify genes that are highest on + and - axis and plot them on PC1/PC2 and PC3/PC4 plots.
The first two components:
PCHeatmap(dataset,dims = c(1,2),ncol =2)Aldoc and Mog genes look interesting:
plot_f1 <- FeaturePlot(dataset,features = c('Aldoc','Mog'),reduction = 'pca',dims = c(1,2),cols = c('gray','blue','red'))
plot_f2 <- FeaturePlot(dataset,features = c('Aldoc','Mog'),reduction = 'pca',dims = c(3,4),cols = c('gray','blue','red'))
plot_f1 / plot_f2Now components 3 and 4:
PCHeatmap(dataset,dims = c(3,4),ncol =2)C1qa and Ly6c1 genes look interesting:
plot_f3 <- FeaturePlot(dataset,features = c('Cldn1','Ly6c1'),reduction = 'pca',dims = c(1,2),cols = c('gray','blue','red'))
plot_f4 <- FeaturePlot(dataset,features = c('C1qa','Ly6c1'),reduction = 'pca',dims = c(3,4),cols = c('gray','blue','red'))
plot_f3 / plot_f4Check principle components 10, 20, 30, 40 and
50.
- What differences do you see?
- Would you include all principle components for downsteram analysis?
Why/why not? \(~\)
\(~\)
We will use a graph-based clustering algorithm discussed at the lecture.
We need to build a neighborhood graph. In this network graph, each cell will be a node, and their similarity in high dimensional space will become their edges.
One could say that cells closest to each other reside in a neighborhood.
dataset <- FindNeighbors(object = dataset, dims = 1:20) ## Computing nearest neighbor graph
## Computing SNN
dataset <- FindClusters(object = dataset, resolution = 0.6) # changing the resolution will change the size/number of clusters! c(0.6, 0.8, 1)## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 6658
## Number of edges: 246566
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9164
## Number of communities: 18
## Elapsed time: 0 seconds
VlnPlot(object = dataset, features = c("nCount_RNA", "nFeature_RNA"))We will use the top PCs to calculate the umap and tSNE coordinates. You can change the numnber of PCs based on your observation in Practice 2.
dataset <- RunUMAP(object = dataset, reduction = "pca", dims = 1:20, verbose = FALSE)## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
dataset <- RunTSNE(object = dataset, dims = 1:20, verbose = FALSE)Do you want to see them? adjust the DimPlot function to show
the PCA, UMAP and t-SNE results! \(~\)
Include the cluster colors to visualize clusters. \(~\)
How do the clusters distribute on different plots?
\(~\)
Why is there a difference? \(~\)
Check Use Google or Pubmed to find marker genes for neurons, inhibitory neurons, astrocytes and oligodendrocytes.
Hint: The following papers have marker genes for relevant cell types:
Plot the expression of these marker genes using the
VlnPLot() function. \(~\)
The following parts are extras. If you have time, please follow them
The following is a question from teh last year. We didn’t go into details of k-Means this year, thus the question may look out of context. But it will give you an idea of what can be expected. A indicated the answer
Please list four important facts about the k-means algorithm on the following topics: i) What is it used for? > A: For classification of data into groups/clusters
Please explain the important steps > A: Determine the expected number of clusters. Take random points in the data and calculate the distance of each point to these random points to assign clusters. Then calculate the center of the cluster. Finally, repeat this process until there is no change in the centeral point, meaning that a stability is reached.
Name one of the drawbacks > A: needs estimation of the number of clusters beforehand. Cannot work on all different ‘shapes’ of data. Cannot find outliers. Does not show how similar each cluster is.
If you run the k-means algorithm on the same data two different times, do you always get the same results? Why/why not? > A: No. The process is stochastic, starts randomly and is repeated many times until stability is reached. The final results will, in most cases, be different.
Here is another question for you to think about:
Which steps of the single cell RNA sequencing analysis aims at getting rid of the differences in number of reads between cells (e.g. a cell with 1000 features and another with 5000 features)?
A: I wont provide the answer for this one
Follow this part if there is time left after the peer - group discussion
You are not expected to follow/understand this part and none of it will be in the exam. Thius is purely for people who are curious about the details of the techniques
A machine learning algorithm that will place cells similar on higher dimension (e.g. with respect to 20000 genes/dimensions) close to each other in the lower dimension.
More specifically, t-SNE is a simple machine learning algorithm that minimizes the divergence between two distributions of pairwise similarity; input objects and the corresponding low-dimensional points in the embedding.
The formula looks like this:
source: “https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668”
You definitely dont need to know this! But this is all t-sNE is; 4 steps repeated again and again. The following plot show 1000 iterations that the algorithm performs. Note that with each calculation, clusters are better seperated. However, it is still hard to find where exactly to place some of these cells, that continue to jiggle even at the end.
Source: (https://www.oreilly.com/content/an-illustrated-introduction-to-the-t-sne-algorithm/)
UMAP is a machine learning algorithm that uses ‘manifolds’ to estimate the relationship between the data points at high dimension with lower dimension. Manifolds are built on simplices. Here are some low dimensional examples.
There is also a formula involved in the calculations which we wont show. The algorithm with a ‘different approach’ than t-SNE to find the relationship between cells (or otehr data points/samples).
You can find more information on : “https://umap-learn.readthedocs.io/en/latest/how_umap_works.html”
Is UMAP better than t-SNE, or vice versa? Neither, really. UMAP has more advantages while looking as ‘global similarities’ while t-SNE performs well when looking at ‘local interactions’. Both are extremely good and popular methods in visualising high dimensional data.
The images for info boxes are taken from https://www.cleanpng.com
Also see: https://umap-learn.readthedocs.io/en/latest/how_umap_works.html
https://www.oreilly.com/content/an-illustrated-introduction-to-the-t-sne-algorithm